COVID-19 deaths in the U.S. have surpassed 256,000, and the virus is now the third leading cause of death (COD) in this country, after heart disease and cancer. Before the pandemic, the U.S. already had a high overall mortality rate, and the gap has widened in the last few decades. In this analysis, we put the pandemic’s toll into perspective by comparing where COVID-19 falls as a leading cause of death in the U.S. and how it has affected the number of deaths in the other 12 causes. The purpose of this causes of death analysis is to understand the burden of mortality that are directly or indirectly attributed to COVID-19. Our project seeks to find if COVID-19 had any affect on the number of deaths in the other 12 causes.
The CDC National Center for Health Statistics (NCHS) collects weekly counts of deaths by state and of select causes that are categorized by underlying cause of death listed in the standardized health care grouping of ICD-10 codes. From 2014 to 2109, there were 12 main COD listed, including leading U.S. killers such as diseases of the heart, diabetes, and lower respiratory. With the onset of COVID-19, two more causes were added: COVID-19 Multiple Cause of Death and COVID-19 Underlying Cause of Death (see Figure 1).
Figure 1: Example Death Certificate
There are three sources of data we used for this analysis: Weekly Counts of Deaths by State and Select Causes, 2014-2018, Weekly Counts of Deaths by State and Select Causes, 2019-2020, and U.S. 2020 Population Density. We combined CDC weekly death counts into a dataset that represent provisional counts of deaths by the week the deaths occurred, by state of occurrence, and by select underlying causes of death from 2014-2020. The dataset also includes weekly provisional counts of death for COVID-19, coded to ICD-10 code U07.1 as an underlying or multiple cause of death (see Figure 1).
It is worth noting that our project studies provisional deaths and not final deaths. Provisional deaths are subject to change as more information becomes available and are known to change even after a year of receiving the initial death certificate. Even with this caveat, provisional deaths are the most accurate way to analyze deaths when used in comparison for current year data. In our project we used a 6 week cutoff for our data. While not perfect, it did allow us to include data that may come in on a monthly basis, plus the extra time that is often incurred with COVID-19 deaths. Regarding our analysis, dates closer to current date may be less accurate than dates from previous months. More information on how the CDC counts deaths can be found on their website.
The source datasets were pretty clean, so only minimal pre-processing was necessary. We saved the cause of death data in two formats: wide and long, where the long format has one row per cause of death for a given state and week. Here are the cleanup steps we took on the source data:
Causes of Death Dataset Cleanup:
U.S. Population Density State Cause of Death Dataset Cleanup:
The COD dataset has 352 rows and 14 columns, where the first column captures the week date. To give you an idea of what the combined cause of death data looked like, here are the first 10 rows that show COVID-19 deaths:
The U.S. population density state causes of death dataset has 239,954 rows and 15 columns. This is in a long format, where each row represents a COD for that given week and state. Here is a sample of the data:
To get a sense of the distribution of the deaths across causes you can look at Figure 2, where we show the total death count for the years prior to COVID-19 (2014-2019) versus how it looks in the year 2020. Heart disease and cancer clearly have a large percentage of the deaths, and you can see the impact when COVID-19 showed up in 2020.
Figure 2: Total deaths by cause
To get an overall feel for COVID-19 deaths across the U.S., Figure 3 gives you an idea of the regional distribution. This figure has death counts normalized by population (per capita) to make a more fair comparison. Hovering over an individual state shows the state’s name, number of COVID-19 deaths, and the number of COVID-19 deaths compared to the population of that state. By calculating the COVID-19 deaths per capita, we can see what states are getting hit harder and look for regional and subregional patterns with further analysis. Looking at Figure 3, we can see a strong concentration of per capita COVID-19 deaths for the South and North-East regions.
Figure 3: COVID-19 deaths across U.S. per capita
The main focus of our analysis was comparing the COD over time. In other words, we looked for a correlation between one COD over time with another COD, with special focus being placed on COVID-19. We used four strategies to address this question: Perason correlation coefficients, Granger causality, predicted deaths, and descriptive visualizations like time series plots. The time series plots started our analysis by looking for interesting patterns. We then take a more statistical approach which help provide evidence of a relationship between COD and COVID-19 and informed our analytical focus. Then, taking the evidence found in the previous analysis, we look to see if the patterns extend to individual states and regions.
We started our analysis with a simple time series plot, showing the years 2019-2020 (see Figure 4). We wanted to see if any unusual patterns jumped out to us. The first that is easy to see is the fast rise of COVID-19 deaths at the beginning of 2020. We see a few other obvious changes: most COD had a little spike when COVID-19 first started, suggesting that either the other COD just happened to also increase at the same time, or there was a more interesting correlation between COVID-19 and the other COD. Heart disease demonstrates the largest spike during the early stages of the COVID-19 pandemic. There is also an odd rise in unknown COD which is still continuing to rise today.
Figure 4: Causes of Death from 2019-2020
Another way of viewing the COD over time is by percentage: what proportion of all deaths is related to each COD (see Figure 5). This stacked view of the data doesn’t show total counts, but does show the disruption COVID-19 caused to the percentages.
Figure 5: Causes of Death Percentage from 2019-2020
We then started looking at statistical methods to evaluate correlations between COD over time. If this was not dealing with time series data, the default correlation analysis would be Pearson correlation coefficients. However, since time series data has random walk characteristics, that can lead to spurious correlations (not real). You can usually overcome this by taking the difference of the lag values. That is technique we used in our analysis of the year 2020, when COVID-19 death counts were introduced. You can see the Pearson correlation results in Figure 6, where the range is -1 to 1, with values close to 0 being weak correlations. The COD that shows a strong correlation is heart disease (0.7), and a couple that show moderate correlation are diabetes (0.6) and Alzheimer’s (0.5).
Figure 6: Pearson correlation coefficient for weeks with COVID-19 deaths
One of the most powerful ways to identify a correlation between time series data is using the Granger causality test. It is related to the time series modeling technique called Vector autogression (VAR). VAR is a multivariate time series technique that can predict future values by using two or more autoregressive variables. In simple terms, the lags of two time series can be used to predict one time series. For example, if you know the price of gold and oil yesterday, you can better predict the price of gold tomorrow knowing both previous prices. Related to this analysis of COVID-19, does COVID-19 death count help predict the death count of another COD? This also applies the other way around: does another COD help predict COVID-19? This is where the “causality” part of the Granger causality test name comes from. You can actually test for directional correlation. However, do not be mislead by the term “causality”, as this test does not prove causality, only correlation. A little bit of trivia for those interested, Clive Granger and his co-winner Robert F. Engle won a Nobel prize for their work in macroeconomic analysis that used this Granger causality test.
There are just a few more process steps to share before showing the results. First, only weeks with COVID-19 deaths are used in this data, since that is our primary interest in this analysis. Second, an automated selection of autoregressive terms was used based best max lag up to 10 possible lags. The number of lags is important, because as mentioned previously, we need to build the best VAR model we can for each pair of time series. The result of our Granger analysis can be seen in Figure 7, where we show all Granger causality relationships with COVID-19. Note, the p-value is what determines if the relationship is significant (0.05 is highlighted by a vertical reference guide). Based on the results of the Granger causality test, COVID-19 was significant in predicting the other COD in all cases except for cancer and unknown causes. This is pretty strong evidence of a correlation between the other COD and COVID-19 over time.
Figure 7: Granger causality for weeks with COVID-19 deaths
Another way to show correlations is by predicting what we think the death count for each COD should have been in 2020, and then compare the prediction with the actual death count. If we see unusual behavior during the COVID-19 pandemic, then that suggests there is a correlation. For that purpose, we first created a predictive model for the non-COVID-19 COD. We used a time series modeling technique called autoregressive integrated moving average (ARIMA). After we had the models created, we predicted what the death count should have been in 2020. Comparing the predictions and a 95% confidence interval with the actual death counts, we end up with Figure 8. Quite a few COD show some unusual activity: Alzheimer’s, heart disease, diabetes, cerebrovascular, kidney disease, other respiratory, and unknown causes. A few of the biggest unusual patterns can be seen with Alzheimer’s, diabetes, heart disease, and unknown causes. This gives us additional evidence that COVID-19 is correlated to these COD, as was identified in our Pearson correlation (see Figure 6) and Granger analysis (see Figure 7). Unknown causes has a pattern that can’t directly be understood given its wide coverage of conditions. We aren’t completely sure why it shows such unusual behavior. However, whatever the cause, we don’t believe it is related to COVID-19 (at least not directly), since neither Pearson correlations nor Granger causality show any correlation.
Figure 8: Predicted Deaths vs Actual Deaths
Our map in Figure 3 showed that patterns emerged when evaluating by region. Initially we analyzed causes of death by percentage and number of deaths by state and subregion, but that quickly became unwieldy when using more than one year. In Figure 9, we have set the years from 2018-2020 to give the optimal visual display for comparison, while eliminating some of the redundancy of similarities between 2014-2019.
Evaluating causes of death by region and year gives us an additional view of how COVID-19 affects the different causes of death. Viewing all years reveals a relatively stable disbursement of the percentages between the causes of death between 2014 and 2019. However, as Figure 9 shows, we start to see a different picture when we include 2020. As expected, this is most notable in causes that already have a high percentage of the total deaths, such as heart disease and cancer, but what is additionally interesting is that the change in percentage is not uniform across regions. For example, the change in percentage for heart disease and Alzheimer’s between 2019 and 2020 in the NorthEast is much higher than the other 3 regions, as is their COVID-19 deaths. And while unknown/other deaths increased across all regions, we see a very unusual jump in the West. The number of cancer deaths hold fairly stable across the year when viewing by region.
By hovering over each bar, you are able to see Cause of death, the percent of deaths for that cause, the total number of deaths for that cause, and the total number of deaths for all causes.
Figure 9: Causes of Death by Region and Year
*Does not include Puerto Rico
While it may seem that Figure 7 shows causes of death going down and other plots showing them going up, that is because the cause of death is reflected in percentage. To verify this we can look at the example of cause of death for Heart Disease in the South across the years. As you can see in table 1 below, (codTotal = cause of death total and YRDeathTotal = total deaths for the year for that region), the overall deaths went drastically up for the same portion of a year (due to COVID-19 deaths), but the rate of heart disease only slightly increased comparatively. However heart disease does show some abnormal upticks, so deeper analysis with controls for population and additional years is needed.
Based on the analysis from the table above, we return back to our original timeline plot, but expand it for all years and still facet by region. In figure 9, we can click on each cause of death to isolate which ones we see in the plots and then autoscale. This gives us a clearer picture when lines may overlap and to scale the y axis for each individual cause.
By doing this we notice some interesting information. The first thing we notice is that from 2014-2015, there were frequent jumps in the counts for cause of death. We also see that the South has the most notable increase/decreases for causes of death in 2020 and sometimes NE. Influenza shows a constant increase/decrease between years, so we can conclude that other factors may play a role in the current uptick for 2020. Unknown* deaths have an undeniable increase across all regions.
Figure 9: Count of Total by Cause of Death by Region
When we combine the regions, we get a clear picture for the overall number of cause of death increases and decreases. Excepting for 2014, Alzheimer’s, cerebrovascular, heart disease, and kidney disease all showed varying amounts of increase, while cancer had a notable decrease. Influenza, lower respiratory, and septicemia are harder to determine as they have outliers and/or dramatic shifts from year to year. Diabetes and unknown* cause of death have increases with or without 2014 data. The category “Other respiratory” appeared unaffected by 2014 data and the increase in deaths does not seem out of line from previous year trends.
Figure 10: Count of Total by Cause of Death
As stated in the introduction, our project sought to find if COVID-19 affected the number of deaths in the other 12 leading causes of death. To determine this, we looked at several different methods: total deaths by cause, the number of deaths related to causes over time, the percentage of death for causes of death over time in stacked format, Pearson Correlation, Granger Causality, causes of death for excess deaths over predicted deaths, geospatial plots to see how deaths are regionally dispersed, and then revisiting our cause of death over time - looking at it from a yearly perspective. While each plot explored different concepts, they were often in agreement that COVID-19 did affect at least some of the numbers of the remaining causes of death. In some cases, like heart disease, Alzheimer’s, and diabetes, there was an increase across investigations. Cancer often showed a decrease.
Some of our plots showed inconclusive results or did not show results related to the net effect. These plots still provided valuable information that we based our subsequent investigations on. Our USA map is an excellent example of this as it helped establish areas with higher cases per capita to look into more closely. The Causes of Death over Time shows interesting patterns of each cause’s ebb-and-flow when the timeline is expanded.
Overall we saw COVID-19 had the most significant affect on the number of deaths from heart disease, diabetes, Alzheimer’s & dementia, cerebrovascular, and unknown causes. Cancer, kidney disease, other respiratory, and septicemia also showed that COVID-19 might play a role in increasing or decreasing death numbers in those categories. Table 4 below highlights the results observed with our analysis.
In table 4, x represents a noted relationship. “I” stands for increase and “D” stands for decrease. Values marked with an "*" means that the increase/decrease appeared significant from observing the corresponding plot. .
| COD Over Time | COD % Over Time | Pearson | Granger | Predicted Deaths | North Central | Northeast | South | West | All Regions | |
|---|---|---|---|---|---|---|---|---|---|---|
| Alzheimer’s | x | x | x | x | I* | I* | I* | I* | I* | I* |
| Cancer | x | D* | D | |||||||
| Cerebrovascular | x | x | x | I* | I | I* | I | I* | ||
| Diabetes | x | x | x | x | I* | I* | I* | I* | I* | |
| Heart Disease | x | x | x | x | I* | I* | I | I* | ||
| Influenze & Pneumonia | x | x | ||||||||
| Kidney Disease | x | x | I* | I | I | I | ||||
| Lower Respiratory | x | D | ||||||||
| Other Respiratory | x | x | I* | I | D | |||||
| Septicemia | x | x | x | D | I | |||||
| Unknown Causes | x | x | I* | I* | I* | I* | I* | I* |
A few key items to mention regarding our analysis:
Once we began our analysis, it became clear that there are a multitude of variables, potentially some not even realized yet, that creates an ever changing landscape for analysis. That said, there are a few things we would consider to increase understanding of our findings:
It is clear from the data currently available that COVID-19 has played a role in increasing deaths for some causes of death, while decreasing it for others. Timely information can be helpful in the short term for determining strategies - particularly for those in risk categories of the increased causes of death - but ultimately we will need more time to fully understand the full affects.